Amendments to the Claims: 



1 (previously presented): A document descriptor determination method 
comprising the steps of: 

generalizing input sequences associated with a document to develop general 
sequences, said input sequences reflecting the structure of a document; 

factoring said input sequences and said general sequences to develop factored 
sequences; 

selecting a document descriptor from said input sequences, said general 
sequences, and said factored sequences using minimum descriptor length (MDL) 
principles. 

2. (original): The method of claim 1, wherein said selecting step comprises the 
steps of: 

encoding said input sequences, said general sequences, and said factored 
sequences; and 

selecting a document descriptor which encompasses all of said input sequences 
and exhibits a minimum MDL cost. 

3. (original): The method of claim 2, wherein said encoding step employs an 
algorithm which applies a set of rules comprising: 

seq(D.s) = e if D=s, if D does not contain metacharacters; 
seq(D 1 ...D kl Si...s k ) = seq(D 1 ,Si)...seq(D k) s k ); 
seq(D 1 |...|D m ,s) = iseq(D i ,s); 

seq(D*,Si...s k ) = {k seq(D,Si)...seq(D,s k ) if k>0; 0 otherwise}; 

wherein D is a sequence of symbols, s is a sequence, and i is an index of a 
regular expression that the corresponding sequence s matches, wherein log m bits are 
needed to encode index i. 



2 



4. (original): The method of claim 3, wherein said minimum MDL cost is 
determined by employing an algorithm to solve a facility location problem (FLP), said 
FLP modified to compute said minimum MDL cost of potential document descriptors. 

5. (original): The method of claim 4, wherein said document descriptor is a 
document type descriptor (DTD), and said document is an extensible Markup Language 
(XML) document. 

6. (original): The method of claim 5, wherein said minimum MDL cost comprises 
summing a first length of bits describing the DTD and a second length of bits for 
encoding the sequences. 

7. (previously presented): A document descriptor determination method 
comprising the steps of: 

generalizing input sequences to develop general sequences, said input 
sequences reflecting the structure of data within a document; 

selecting a document descriptor from said input sequences and said general 
sequences using minimum descriptor length (MDL) principles. 

8. (original): The method of claim 7, wherein said selecting step comprises the 
steps of: 

encoding said input sequences and said general sequences; and 
selecting a document descriptor which encompasses all of said input sequences 
and exhibits a minimum MDL cost. 

9. (original): The method of claim 8, wherein said encoding step employs an 
algorithms which applies a set of rules comprising: 

seq(D,s) = e if D=s, if D does not contain metacharacters; 

seq(Di...D k , Si...s k ) = seq(D 1) si)...seq(D k ,s k ) ) if D is a concatenation of Di...D k ; 

seq(Di|...|D mi s) = iseq(Dj,s); 

seq(D*,Si...s k ) = {k seq(D,Si)...seq(D,s k ) if k>0; 0 otherwise}; 
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wherein D is a sequence of symbols, s is a sequence, and i is an index of a 
regular expression that the corresponding sequence s matches, wherein log m bits are 
needed to encode index i. 

10. (original): The method of claim 9, wherein said minimum MDL cost is 
determined by employing an algorithm to solve a facility location problem (FLP), 
wherein said FLP is modified to compute said minimum MDL cost of potential document 
descriptors. 

11. (original): The method of claim 10, wherein said document descriptor is a 
document type descriptor (DTD), and said document is an extensible Markup Language 
(XML) document. 

12. (original): The method of claim 11, wherein said minimum MDL cost 
comprises summing a first length of bits describing the DTD and a second length of bits 
for encoding the sequences. 

13. (previously presented): The method of claim 7, further comprising the step 

of: 

factoring said input sequences and said general sequences to develop factored 
sequences, wherein said factored sequences are available for said step of selecting. 

14. (original): A computer-readable medium encoded with a computer program 
for generalizing input sequences to develop general sequences, said computer program 
comprising: 

a discover OR patterns procedure; 

a discover sequence patterns procedure; and 

a generalize procedure which calls said discover sequence patterns procedure 
and calls said discover OR patterns procedure, wherein said discover OR patterns 
procedure is nested within said discover sequence patterns procedure. 
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15. (original): The computer-readable medium of claim 14, said computer 
program further comprising a partition procedure called by said discover OR patterns 
procedure. 

16. (original): A method for generalizing input sequences to develop general 
sequences comprising the steps of: 

discovering OR patterns among said input sequences; and 

discovering sequence patterns among said input sequences and OR patterns. 

17. (original): The method of claim 16, wherein said step of discovering OR 
patterns comprises the step of partitioning said input sequences. 

18. (currently amended): A document descriptor determination method 
comprising the steps of: 

generalizing input sequences, said generalizing step comprising the steps of: 

discovering OR patterns among said input sequences, and 

discovering sequence patterns among said input sequences and OR 

patterns; and 

selecting a document descriptor from said input sequences and said general 
sequences. 

19. (original): The method of claim 18, wherein said discovering OR patterns 
step comprises the step of partitioning said input sequences. 

20. (original): The method of claim 19, further comprising the steps of: 
factoring said input sequences and said general sequences to develop factored 

sequences, wherein said factored sequences are available to said step of selecting. 

21 . (original): The method of claim 20, wherein said step of selecting employs 
minimum descriptor length (MDL) principles. 
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22. (original): The method of claim 21, wherein said document descriptor is a 
document type descriptor (DTD) and said document is an extensible Markup Language 
(XML) document. 
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